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Spectral  methods  have  proven  invaluable  in  numerical  simulation  of  P&fis, 
but  the  fre([uent  global  communication  required  raises  a  fundamental  barrier  to 
their  use  on  highly  parallel  architectures.  To  explore  this  issue,  we  implemented 
a  three  dimensional  implicit  spectral  method  on  an  Intel  hypercube.  Utilization 
of  about  50%  was  achieved  on  a  32  node  iPSC/860  hypercube,  for  a  61  x  64  x  64 
Fourier-spectral  grid;  finer  grids  yield  higher  utilizations. 

Chebyshev-spectral  grids  are  more  problematic,  since  plane- relaxation  based 
multigrid  is  required.  However,  by  using  a  semicoarsening  multigrid  algorithm, 
and  by  relaxing  all  multigrid  levels  concurrently,  relatively  high  utilizations  were 
also  achieved  in  this  harder  case.  In  fact,  since  the  amount  of  work  per  processor 
was  higher  in  this  case,  we  achieved  somewhat  higher  utilization,  typically  60% 
on  moderate  sized  problems.  Thus  spectral  methods  remain  attractive  on  the 
current  generation  of  distributed  memory  architectures. 


* Ttesearch  Kv  ftiiu  ispace  Auiiiimstration  under  NASA  Contract 

NA.Sl-1800.5,  while  the  second  author  was  in  residence  at  ICASE. 
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1.  Introduction.  Spectral  methods  have  proven  of  great  value  in  numer¬ 
ical  simulation  of  turbulence,  numerical  weather  prediction,  acoustics,  and  in  a 
variety  of  other  applications.  Unlike  difference  methods,  spectral  methods  accu¬ 
rately  approximate  all  frequencies  present  on  the  grid,  and  are  thus  well  suited  to 
problems,  like  turbulence,  where  accurate  resolution  of  evolving  solutions  is  crit¬ 
ical.  However,  on  parallel  machines,  the  frequent  global  communication  required 
by  spectral  methods  seems  to  impose  a  fundamental  barrier  to  their  continued 
use.  In  this  paper,  vve  study  this  issue,  in  the  context  of  time  dependent  implicit 
spectral  methods. 

Communication  arises  in  spectral  methods  in  two  principal  ways.  First,  eval¬ 
uation  of  the  spectral  operator  requires  global  communication,  usually  in  multidi- 
mensicn?.!  FFTe.  S>jccr’d,  in  many  problems,  viscous  stability  limits  constrain  the 
time  step  so  severely  that  one  must  resort  to  implicit  methods.  This  holds  true 
especially  for  Chebyshev-spectral  methods,  where  the  close  spacing  ot  collocation 
points  near  boundaries  imposes  severe  stability  limits  on  explicit  methods. 

Given  the  frequent  global  communication  required,  it  is  not  clear  whether 
spectral  methods  remain  attractive  on  parallel  architectures.  That  is,  the  subtle 
numerical  advantages  of  spectral  methods  may  be  completely  swamped  by  the 
high  cost  of  communication  and  synchronization.  This  would  be  unfortunate, 
since  there  are  many  applications  in  which  spectral  methods  are  invaluable[2]. 
Thus  we  undertook  to  study  the  basic  issues  involved  m  implementing  spectral 
methods  on  distributed  memory  machines,  in  order  to  assess  the  extent  to  which 
spectral  methods  remain  competitive  on  these  architectures. 

To  explore  this  issue,  we  designed  and  implemented  an  implicit  spectral 
method  on  the  Intel  iPSC/860  hypercube.  Our  program  performs  a  multigrid- 
based  implicit  solution  of  the  time  dependent,  variable  coefficient  Helmholtz  equa¬ 
tion, 

ut  -  \/a{x,y,z)  ■  -  b{x,y,z)u  +  f{x,y,z,t), 

on  three  dimensional  tensor  product  grids,  a  problem  arising  in  the  Uzawa  for¬ 
mulation  of  the  incompressible  Navier  Stokes  equations  and  in  ocean  circulation 
problems. 

2.  Finite  Element  Preconditioners.  One  can  solve  the  nearly  dense  lin¬ 

ear  systems[2]  arising  in  spectral  methods  in  a  variety  of  ways.  One  of  the 
best  approaches  is  to  use  a  Richardson  or  conjugate  gradient  iteration,  precon¬ 
ditioned  by  inversion  of  a  low  order  finite  element  system.  Suppose  S  is  the 
Fourier-spectral  discrete  Laplacian,  and  let  H  he  a  low  order  finite  difference  or 
finite  clement  operator.  If  H  is  the  standard  5  point  Laplacian  in  two  dimensions, 
the  spectral  condition  number  of  the  preconditioned  spectral  system,  is 

K  —  9  /i7  whil'=  for  bilinear  finite  elements  it  is  1.44. 
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With  this  improved  condition  number,  noted  originally  by  Deville  and  Mund[4], 
the  convergence  rate  of  optimal  parameter  Richardson  iteration,  given  by 

AC  —  1 


drops  from  0.42  to  0.18,  making  Richardson  iteration  quite  effective. 

The  condition  number  of  the  preconditioned  Fourier-spectral  operator  can  be 
further  improved  by  introducing  mass  lumping  into  the  finite  element  discretiza¬ 
tion.  In  one  dimension,  this  amounts  to  replacing  the  finite  element  system 

K  u  =  A/  f 

by  an  analogous  system  with  a  partially  lumped  mass  matrix: 

A7  =  0.95  M  -P  0.05  1 

In  higher  dimensions,  one  gets  the  same  effect  by  tensoring  one  dimensional 
discretizations.  Suppose  one  has  the  improved  one  dimensional  finite  element 
discretizations 


Ki  =  Mi  V 
I<i  u’  =  Mi  f’ 

along  X  and  y  mesh  lines  respectively.  Then  the  analogous  two  dimensional  finite 
element  discretization  is: 

{Ki®Mi  +  Mi®Ki)u  =  Ml®  Mi  f 

The  condition  number  of  the  preconditioned  spectral  system,  based  on  this  im¬ 
proved  finite  element  discretization,  is  1.26,  for  a  Richardson  convergence  rate  of 
0.11.  Since  one  typically  needs  to  reduce  the  initial  residual  by  four  or  five  orders 
of  magnitude  at  each  implicit  time  step,  five  or  six  preconditioned  iterations  are 
needed  per  time  step. 

3.  Fourier-spectral  Grids.  Fourier-spectral  codes  are  used  in  problems 
with  periodic  boundary  conditions.  For  our  model  problem,  at  each  time-step, 
we  form  spectral  re:Idu.'’!s  by  fast  Fourier  transforms,  and  invert  the  spectral 
linear  system  by  a  sequence  of  preconditioned  Richardr:'n  iterations.  Since  the 
convergence  rate  of  the  Richardson  iteration  is  0.1 1,  one  multigrid  V-cycle  suffices 
to  adequately  solve  the  preconditioning  finite  element  system  at  each  iteration. 

Data  Bistriuutiuii.  Tiiegiid.s  need  to  be  clistnouted  across  the  processors  in 
a  way  that  minimizes  communication  and  balances  the  load.  In  this  section,  the 
communication  requirements  of  tw'o  alternate  data  distributions,  which  we  refer 
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levels 


Figure  1:  Multigrid  Using  the  Z-Distribution 

to  as  the  2-  and  the  a:y2-distributions,  are  compared.  In  the  2-distribution,  only 
the  2  direction  is  blocked,  so  adjacent  xy-planes  are  placed  on  each  processor.  In 
the  xy2-distribution,  all  three  coordinate  directions  are  blocked,  so  each  processor 
owns  a  subcube  of  the  grid.  Both  data  distributions  force  communication  in  both 
the  FFT  residual  computation  and  in  the  multigrid  solve. 

With  both  distributions,  we  compute  the  FFTs  in  all  three  coordinate  di¬ 
rections  sequentially  on  processors,  using  global  exchanges  to  bring  all  required 
data  into  the  processors.  Since  each  xy-plane  lies  in  a  single  processor  with  the 
2-distribution,  this  distribution  recjuires  interprocessor  communication  only  for 
the  2-direction  FFT.  In  this  case,  a  complete  exchange  of  data,  where  each  pro¬ 
cessor  sends  a  block  of  data  of  size  n^fp^,  for  an  7i  x  n  x  n  grid,  to  the  p  —  \  other 
processors,  is  'equired. 

The  x|/2-distribution  requires  approximately  twice  as  much  communication 
for  the  spectral  residual  computation.  Communication  is  required  so  that  each 
processor  holds  xy-planes  before  the  FFTs  in  the  x  and  y  directions  arc  calcu¬ 
lated.  In  addition,  before  the  2  direction  is  calculated,  processors  must  hold  yz 
or  xz  planes.  Both  exchanges  can  be  done  with  each  processor  sending  messages 
of  size  n^/  to  —  1  other  processors. 

‘P,  Hncl  ihe  .T7/2-distribution2  also  require  communication  of  bound¬ 
ary  data  after  each  relaxation,  restriction,  and  prolongation  in  the  multigrid 
V-cyclc.  In  the  2-distribution,  processor  i  must  communicate  boundary  planes 
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to  processors  i  +  1  and  i  —  1.  In  the  ry^-distribution,  each  processor  must  com¬ 
municate  boundary  data  to  six  neighboring  processors,  but  the  message  sizes 
are  shorter.  There  is  also  additional  intcrprocessor  communication  required  with 
both  data  distributions,  when  performing  restriction  or  prolongation  operations 
between  levels  having  idle  processors.  Since  there  are  fewer  active  processors  cn 
each  coarser  level,  the  work  of  processors  becoming  idle  needs  to  be  sent  to  the 
remaining  active  processors. 

For  an  n  X  X  77  problem,  the  number  of  grid  levels  in  the  multigrid  V-cycle 
is  log2  n.  The  number  of  grid  points  per  level  is  8‘ ,  where  I  denotes  the  level 
number,  with  /  =  1  the  coarsest  grid.  The  number  of  grid  points  owned  by 
each  processor,  on  the  grid  levels  with  no  idle  processors,  is  8^/p  for  both  data 
distributions.  The  number  of  coarse  grid  levels  containing  idle  processors  for  the 
2-distribution  is  determined  by  logjp.  On  these  levels,  active  processors  contain 
one  xy-plane  with  size  4^  In  the  xy^-distribution,  the  number  of  coarse  grid 
levels  containing  idle  processors  is  determined  by  |  log2  p,  and  on  these  levels, 
active  processors  contain  only  one  grid  point.  Figure  1  shows  the  communication 
required  by  the  2-distribution  during  the  multigrid  V-cycle. 

Using  this  information,  we  computed  the  amount  of  communication  required 
per  Richardson  iteration  for  each  of  the  data  distributions.  The  message  startup 
and  byte  transfer  rate  were  estimated  using  values  reported  for  the  iPSC/860 
by  Bokhari[l].  Data  distribution  in  the  2-direction  led  to  higher  communication 
costs  in  the  multigrid  algorithm,  due  both  to  the  greater  load  imbalance  and 
longer  message  lengths.  However,  the  additional  communication  required  by 
the  syz-distribution  for  the  FFTs  led  to  higher  communication  costs  for  the 
complete  Richardson  iteration.  The  amount  of  communication  is  a  function  of 
both  problem  size  and  the  number  of  processors.  Generally,  for  large  n  and 
p  St  n,  the  xyz-distribution  is  more  efficient,  while  for  moderate  n  and  p,  the 
2-distribution  is  superior.  Based  on  this  analysis,  we  implemented  the  Fourier- 
spectral  code  using  the  2-distribution,  since  it  appeared  to  be  significantly  more 
efhcient  for  our  machine  and  problem  sizes. 

Experimental  Results:  Fourier  Case,  i  he  experiments  were  performed 
on  the  .32-processor  iPSC/S60  at  ICASE.  Using  the  best  current  compiler  (Port¬ 
land  Group),  our  utilization  was  about  65%  for  the  spectral  residual  calculation 
and  about  40%  for  the  multigrid  solution.  The  load  imbalance  and  large  num¬ 
ber  of  cenimunication  steps  in  multigrid  led  to  this  lower  utilization.  Thus  our 
overall  utilization,  including  both  the  residual  calculation  and  multigrid  solution 
was  about  50%. 

4.  Chebyshev  Grids.  Solving  the  spectral  equations  on  Chebyshev  grids 
is  inherently  more  difficult,  since  the  grid  stretching  leads  to  poor  condition 
numbers,  and  since  the  matrix  corresponding  to  the  p.seudospectral  discretization 
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Figure  2:  Concurrent  Multigrid  Data  Distribution 

on  Chebyshev  grids  is  asymmetric.  However,  a  more  serious  issue  is  the  problem 
of  solving  the  finite  element  system  on  highly  stretched  grids.  When  the  mesh 
aspect  ratios 

Ax/Ay,  Ay/Ax 

are  large,  point  relaxation  multigrid  is  inelTective.  Line  relaxation  suffices  to 
resolve  this  problem  in  some  cases;  however,  in  the  general  case,  one  must  use 
plane-relaxation  based  multigrid. 

There  are  two  viable  kinds  of  plane-relaxation  based  multigrid,  algorithms 
employing  plane  relaxation  sweeps  in  all  three  coordinate  directions  and  algo¬ 
rithms  using  plane  relaxation  in  only  one  direction.  The  latter  are  known  as 
‘‘semi-coarsening”  algorithms,  since  the  grid  is  coarsened  in  only  one  coordinate 
direction.  That  is,  if  the  fine  grid  is  an  n  x  n  x  n  grid,  the  next  coarser  grid 
will  be  an  n  X  n  X  n/2  grid,  the  one  after  that  will  be  an  n  x  n  x  n/4  grid,  and 
so  on. 

Semi-coarsening  algorithms  are  cheaper  than  plane  relaxation  algorithms  with 
relaxation  in  all  three  coordinate  directions,  since  plane  relaxation  is  needed 
in  only  one  direction.  They  also  converge  faster  and  are  less  sensitive  to  grid 
stretching[3].  In  addition  to  these  numerical  advantages,  this  algorithm  is  attrac¬ 
tive  for  parallel  computing,  since  the  ^-distribution  allows  the  plane  relaxations 
to  be  carried  out  with  relatively  little  interprocessor  communication. 

Despite  these  advantages,  there  is  an  inherent  problem  with  this  approach; 
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the  sizes  of  the  j)!aii<'s  do  not  decrease  as  one  goes  to  coarser  grids,  leading  to 
vcr\  poor  utilizations.  We  addres.sed  this  issue  In'  using  concurrent  it('ration 
in  which  all  grid  levels  are  simultaneously  relaxed[5].  ('ond)ining  a  concuriv'nt 
relaxation  niultigrid  algorithm  in  the  --direction,  with  a  standard  semi-coarsening 
line  relaxation  algorithm  in  xy-planes,  led  to  a  robust  and  effect i\(  algorithm 
which  is  highly  parallel  and  maps  easily  to  distributed  memory  architectures. 

Experimental  Results:  Chebyshev  Case.  The  Chebyshev  muitigrid  was 
implemented  using  a  concurrent  iteration  muitigrid  scheme  with  red-black  plane' 
relaxation.  Figure  2  illustrates  the  distribution  of  fine  and  coarse  grid  planes 
across  the  processors  and  the  communication  required  during  the  restriction 
phase.  The  prolongation  communication  is  similar  to  that  for  the  restriction. 

For  tlie  Cheb^'shev  case,  we  obtained  processor  utilization  of  a])proximately  60% 
for  a  .'32  x  32  x  32  problem.  Tfie  increa.se  in  utilization  was  due  to  loth  the  im¬ 
proved  load  balance  and  the  increased  ratio  of  com])utation  to  comniunication. 

5.  Conclusions.  .Mapping  implicit  spectral  codes  to  distributed  memory 
architectures  is  difFicult.  VVhile  we  achieved  50%  processor  utilization  on  both  the 
Fourier-spectral  and  Chebyshev-spectral  codes,  this  performance  is  very  sensitive 
to  the  architecture's  communications  capabilities.  If  processor  speeds  wore  to 
increase  by  a  factor  of  ten,  without  a  commensurate  increase  in  communication 
bandwidth,  spectral  methods  would  become  virtually  unusable. 

As  can  be  seen  from  our  results,  we  obtained  reasonable  processor  utiliza¬ 
tion,  despite  the  relatively  small  size  of  problems  considered,  without  extensive 
program  “tuning.”  With  larger  problems,  having  perhaps  1024  mesh  points  in 
every  direction,  we  would  expect  to  achieve  75%  processor  utilization  on  hy¬ 
percubes  having  a  few  thousand  processors,  assuming  the  present  communica¬ 
tion/computation  speed  ratio.  The  amount  of  exploitable  parallelism  on  this 
class  of  applications  is  really  very  large. 

To  achieve  high  utilization  on  machines  having  thousands  of  processors  will 
require  several  improvements  in  our  algorithm.  First,  some  level  of  overlap  of 
communication  and  computation  is  necessary.  While  this  is  trivial  in  jirinciple, 
it  entails  extensive  programming  changes.  Second,  alternate  ways  of  distributing 
the  computation  on  the  hypercube  needed  to  be  explored.  While  our  approach 
of  distributing  the  data  in  only  the  z-direction  is  optimal  in  some  cases,  it  ex¬ 
acerbates  the  mtdtigrid  “idle  processor”  problem  on  coarse  grids  and  increases 
total  communication.  Thus  it  may  be  better  to  use  hybrid  decompositions,  in 
which  .-^ome  grid  levels  are  decomposed  in  one  way  and  others  in  other  ways. 
Third,  new  variants  of  multigrid[7],  based  on  the  use  of  multiple  "-oarse  grids’ 

'W.  Hackbusch  is  also  exploring  the  use  of  multiple  coarse  grids  to  obtain  robustness  (p('rson 
roriiniunication). 
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promise  to  eliniinatc  the  need  for  line  and  plane  relaxation  altogether,  allowinp^ 
much  higher  levels  of  parallelism.  This  is  probably  tiiO  most  fruitful  direction  for 
future  research  in  this  area. 

Exploring  these  issues  is  interesting,  but  rather  awkward  at  the  moment,  with 
the  current  Intel  software  environment.  The  availability  of  better  programming 
environments,  such  as  the  Kali.  Dino,  and  Fortran  D  languages[b,  8]  should  dra¬ 
matically  ease  exploration  of  such  alternatives. 
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Final  Report 
16  Abstract 

Spectral  metnods  have  proven  invaluable  in  numerical  simulation  of  PDEs,  but 
the  frequent  global  communication  required  raises  a  fundamental  barrier  to  their 
use  on  highly  parallel  architectures.  To  explore  this  issue,  we  implemented  a 
three  dimensional  implicit  spectral  method  on  an  Intel  ’■wpercube.  Utili^.ation 
of  about  50Z  was  achieved  on  a  32  node  iPSC/860  hypercubc,  for  a  64  x  64  x  64 
Fourier-spectral  grid;  finer  grids  yield  1  Igher  utilizations. 

Chebyshev-spectral  grids  are  more  problematic,  since  plane-relaxation  based 
multigrid  is  required.  However,  by  using  a  semicoarsening  mtiltigrid  algoritL.ii, 
and  by  relaxing  all  multigrid  1'  vels  concurrently,  relatively  high  utilizations 
were  also  achieved  in  this  harder  case.  In  fact,  since  the  amount  of  work  per 
processor  was  higher  in  this  case,  we  achieved  somewhat  higher  utilization, 
typically  60%  on  moderate  sized  problems.  Thus  spectral  me'-hr.'.s  remain  attractive 
on  the  current  generation  of  distributed  memory  architectures. 
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